Language Models, Smoothing, and IDF Weighting

نویسندگان

  • Najeeb Abdulmutalib
  • Norbert Fuhr
چکیده

In this paper, we investigate the relationship between smoothing in language models and idf weights. Language models regard the relative within-document-frequency and the relative collection frequency; idf weights are very similar to the latter, but yield higher weights for rare terms. Regarding the correlation between the language model parameters and relevance for two test collections, we find that the idf type of weighting seems to be more appropriate. Based on the observed correlation, we devise empirical smoothing as a new type of term weighting for language models, and retrieval experiments confirm the general applicability of our method. Finally, we show that the most appropriate form of describing the relationship between the language model parameters and relevance seems to be a product form, which confirms a language model proposed before.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cumulative Progress in Language Models for Information Retrieval

The improvements to ad-hoc IR systems over the last decades have been recently criticized as illusionary and based on incorrect baseline comparisons. In this paper several improvements to the LM approach to IR are combined and evaluated: Pitman-Yor Process smoothing, TF-IDF feature weighting and modelbased feedback. The increases in ranking quality are significant and cumulative over the standa...

متن کامل

Axiomatic Analysis of Smoothing Methods in Language Models for Pseudo-relevance Feedback by Hussein Hazimeh Thesis

Pseudo-Relevance Feedback (PRF) is an important general technique for improving retrieval effectiveness without requiring any user effort. Several state-of-the-art PRF models are based on the language modeling approach where a query language model is learned based on feedback documents. In all these models, feedback documents are represented with unigram language models smoothed with a collecti...

متن کامل

Part of Speech Based Term Weighting for Information Retrieval

Automatic language processing tools typically assign to terms so-called ‘weights’ corresponding to the contribution of terms to information content. Traditionally, term weights are computed from lexical statistics, e.g., term frequencies. We propose a new type of term weight that is computed from part of speech (POS) n-gram statistics. The proposed POS-based term weight represents how informati...

متن کامل

A novel term weighting scheme based on discrimination power obtained from past retrieval results

Term weighting for document ranking and retrieval has been an important research topic in information retrieval for decades. We propose a novel term weighting method based on a hypothesis that a term’s role in accumulated retrieval sessions in the past affects its general importance regardless. It utilizes availability of past retrieval results consisting of the queries that contain a particula...

متن کامل

Recovering Trace Links for Sysml Models Using Vsm-based Information Retrieval

Automated traceability recovery utilizing information retrieval techniques has been recognized as important for effective software development. In this paper, we discuss two approaches for augmenting the vector space model (VSM). The first approach employs document identifiers of a term, indicating where the term has been found, and a contextsensitive retrieval strategy that uses these identifi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010